Cloud Native Tenants #
This section describes the core tenets that define how cloud-native applications are designed and built.
Moving to the cloud is a natural evolution of focusing on software, and cloud-native application architectures are at the center of how companies address its unique challenges. By cloud, we mean any computing environment in which computing, networking, and storage resources can be provisioned and released elastically in an on-demand, self-service manner. This definition includes both public cloud infrastructure (such as Amazon Web Services, Google Cloud, or Microsoft Azure) and private cloud infrastructure (such as VMware vSphere or OpenStack).
Speed #
Businesses that are able to innovate, experiment, and deliver software-based solutions quickly are outcompeting those that follow more traditional delivery methods.
Currently, with many of our industry competitors, the time it takes to provision new application environments and deploy new versions of software is typically measured in days or weeks. This lack of speed severely limits the risk that can be taken on by any one release, because the cost of making and fixing a mistake is also measured on that same timescale. Consider, as an example, the number of people swarmed onto the 9.0 release bug fixes and the negative industry feedback and the total man-hour cost and lost reputation and market story involved. Similar problems occur in each release, although to a lesser extent, where we take on aggressive features.
Internet companies as Google, Microsoft, Amazon, etc. are often cited for their practice of deploying hundreds of times per day. Frequent deployments are important because if you deploy at a high frequency, you can recover from mistakes almost instantly. If you can recover from mistakes almost instantly, you can take on more risk and more features – which just might turn into our next competitive advantage.
It is the elastic and self-service nature of cloud-based infrastructure that lends itself to this way of working. Provisioning a new application environment can be made as simple as a single call to a cloud service API. Deploying code to that new environment via another API call adds even more speed. Ultimately, this means that we could deploy a fix or change in minutes or seconds.
Safety #
Cloud native application architectures need to fully balance moving and deploying rapidly with stability, availability, and durability.
As already mentioned, cloud-native application architectures enable us to rapidly recover from mistakes. To date, none of the industry software efforts have provided a consistently measureable improvement in the number of defects that make it into production – in fact, many of the efforts in this area, things like exhaustive documentation, architectural review efforts, and lengthy regression testing cycles all actually get in the way of the speed that we would like to achieve.
How do we go fast and safe?
Visibility #
Alerts and Early warning
Our architectures must provide us with the tools necessary to see failure when it happens. We need the ability to measure key system metrics, establish profiles for “what is normal,” detect deviations from the norm (including rate of change), and identify the components contributing to those deviations.
Feature-rich metrics, monitoring, alerting, and data visualization frameworks and tools are at the heart of all cloud-native architectures.
Fault Isolation #
Reducing Blast Radius
In order to limit risk associated with a failure, we need to limit the scope of components or features that could be affected by a failure. If no one could use SpheraCloud Suite of products every time a feature like ORM went down, that would be disastrous.
By composing systems in Sphera from microservices (the most often employed approach in cloud-native applications), we can limit the scope of a failure in any one microservice to just that microservice if combined with fault tolerance.
Fault Tolerance #
Surviving the Shock
It is not enough to decompose a system into independently deployable components; we must also ensure that a failure in one of those components cannot cause a cascading failure across its possibly many dependencies.
While there are many patterns that deal with this, the most popular is circuit breaker approach. A software circuit breaker is analogous to an electrical circuit breaker: it prevents a cascading failure by opening the path between the component it protects and the remainder of the failing system. It also typically addresses a graceful fallback behavior while the circuit breaker is open.
Automated Reovery #
Healing
With visibility, fault isolation, and fault tolerance, we have the tools we need to identify failure, recover from failure, and provide a reasonable level of service to our customers while we’re engaging in the process of identification and recovery.
Some failures are easy to identify; they present the same easily identifiable pattern every time they occur. Take, for example, a service health check, which typically has a binary answer: healthy or unhealthy, up or down. Many times we will take the same course of action every time we encounter failures like these. In the case of a failed health check, we’ll often simply restart or redeploy the service in question.
Cloud-native application architectures don’t wait for manual intervention in these situations. Instead, they employ automated detection and recovery.
Scale #
As demand increases, we need to scale out capacity to automatically service the demand.
Traditionally this meant handling more demand by scaling vertically: companies had to buy larger servers. This accomplished the goals of scaling; but, very slowly and at great expense. The next iteration of scaling was to increase the number of servers and by installing services on every machine.
The reality in either case is that this leads to capacity planning based on peak usage forecasting. Hardware is purchased and allocated based on the most computing power that will be needed for these peak times. Many times, despite best efforts, it was easy to get this wrong and still blow through available capacity during large activity periods. Typically, however, clients are saddled with tens and in some cases scores of servers with mostly idle CPU’s – resulting in poor utilization metrics and high relative cost per user.
There are basically two strategies that companies use to approach this problem:
- Rather than scaling up, tackle scaling as a horizontal issue by spreading instances across larger number of commodity machines. These machines are easier to acquire and deploy rather quickly.
- Poor utilization of existing larger servers was improved by virtualizing several smaller servers in the same footprint and deploying multiple isolated workloads to them.
In the cloud, these two approaches converge.
The virtualization effort is delegated to the cloud provider (or cloud infrastructure) and the developer focused on horizontal scale of applications across large numbers of cloud server instances. With the recent shift by all cloud providers in moving from virtual servers to containers as the unit of deployment, the door is opened to greater innovation and cost saving strategies.
Companies addressing software design with these in environments require a much lower capital investment, and provisioning via API not only improves the speed of deployment,but also maximizes the speed with which we can respond to changes in demand.
Additionally, another hallmark of cloud-native application architectures is the externalization of state to in-memory data grids, caches, and persistent object stores, while keeping the application or service itself essentially stateless. Stateless applications can be quickly created and destroyed, as well as attached to and detached from external state managers, enhancing the speed with which the system can respond to changes in demand
The biggest challenge is that these benefits come with an inherent cost. Applications must be architected differently for horizontal rather than vertical scaling. The elastic nature of the cloud demands ephemerality.